1.a.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
house_data <- read_csv("kc_house_data.csv")
## Rows: 21613 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (1): id
## dbl  (19): price, bedrooms, bathrooms, sqft_living, sqft_lot, floors, waterf...
## dttm  (1): date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(house_data)
## spc_tbl_ [21,613 × 21] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ id           : chr [1:21613] "7129300520" "6414100192" "5631500400" "2487200875" ...
##  $ date         : POSIXct[1:21613], format: "2014-10-13" "2014-12-09" ...
##  $ price        : num [1:21613] 221900 538000 180000 604000 510000 ...
##  $ bedrooms     : num [1:21613] 3 3 2 4 3 4 3 3 3 3 ...
##  $ bathrooms    : num [1:21613] 1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ...
##  $ sqft_living  : num [1:21613] 1180 2570 770 1960 1680 ...
##  $ sqft_lot     : num [1:21613] 5650 7242 10000 5000 8080 ...
##  $ floors       : num [1:21613] 1 2 1 1 1 1 2 1 1 2 ...
##  $ waterfront   : num [1:21613] 0 0 0 0 0 0 0 0 0 0 ...
##  $ view         : num [1:21613] 0 0 0 0 0 0 0 0 0 0 ...
##  $ condition    : num [1:21613] 3 3 3 5 3 3 3 3 3 3 ...
##  $ grade        : num [1:21613] 7 7 6 7 8 11 7 7 7 7 ...
##  $ sqft_above   : num [1:21613] 1180 2170 770 1050 1680 ...
##  $ sqft_basement: num [1:21613] 0 400 0 910 0 1530 0 0 730 0 ...
##  $ yr_built     : num [1:21613] 1955 1951 1933 1965 1987 ...
##  $ yr_renovated : num [1:21613] 0 1991 0 0 0 ...
##  $ zipcode      : num [1:21613] 98178 98125 98028 98136 98074 ...
##  $ lat          : num [1:21613] 47.5 47.7 47.7 47.5 47.6 ...
##  $ long         : num [1:21613] -122 -122 -122 -122 -122 ...
##  $ sqft_living15: num [1:21613] 1340 1690 2720 1360 1800 ...
##  $ sqft_lot15   : num [1:21613] 5650 7639 8062 5000 7503 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   id = col_character(),
##   ..   date = col_datetime(format = ""),
##   ..   price = col_double(),
##   ..   bedrooms = col_double(),
##   ..   bathrooms = col_double(),
##   ..   sqft_living = col_double(),
##   ..   sqft_lot = col_double(),
##   ..   floors = col_double(),
##   ..   waterfront = col_double(),
##   ..   view = col_double(),
##   ..   condition = col_double(),
##   ..   grade = col_double(),
##   ..   sqft_above = col_double(),
##   ..   sqft_basement = col_double(),
##   ..   yr_built = col_double(),
##   ..   yr_renovated = col_double(),
##   ..   zipcode = col_double(),
##   ..   lat = col_double(),
##   ..   long = col_double(),
##   ..   sqft_living15 = col_double(),
##   ..   sqft_lot15 = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Price and sqft_living labels were labeled by dividing 1000 to visualize. we can interpret the results by mulitplying the values with 1000.

ggplot(house_data, aes(x=sqft_living/1000, y=price)) + geom_point(shape=15,col="darkred")

From the above scatter plot between the price and the square feet of living we can observe that there is not much linear relation ship despite one outlier where even though the living area is too large exceeding 10000 sqft stil the price is in between the 2000000 and 3000000. Most of the data points were densely scattered with the living areas less than the 5000 sqft and price is below 2000000 and the data points were scarcely densed with the prices beyond 2000000.

1.b.

Following scatter plot is the log transformed one with on the price variable to the sqft_living. There is a linear relation ship to little extent however we can observe curve in the plot as we move towards the x-axis.

ggplot(house_data, aes(x=sqft_living, y=log(price))) + geom_point(shape=15,col="darkred")

1.c.

correlation_coefficient <- cor(log(house_data$price), house_data$sqft_living)
print(correlation_coefficient)
## [1] 0.6953406

It shows positive direct relation between the sqft_living and the price where the increase in sqft_living leads to increase in price wihch means that every unit increase in the sqft_living there is a increase in 0.69 units of the price.

.d.i)

Following code is to fit the model of log transformed price which is a response variable to the explanatory variable sqft_living.

fittedlm_log <- lm(log(house_data$price) ~ house_data$sqft_living, data = house_data)
summary(fittedlm_log)
## 
## Call:
## lm(formula = log(house_data$price) ~ house_data$sqft_living, 
##     data = house_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.97781 -0.28543  0.01472  0.26070  1.27628 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            1.222e+01  6.374e-03  1916.9   <2e-16 ***
## house_data$sqft_living 3.987e-04  2.803e-06   142.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3785 on 21611 degrees of freedom
## Multiple R-squared:  0.4835, Adjusted R-squared:  0.4835 
## F-statistic: 2.023e+04 on 1 and 21611 DF,  p-value: < 2.2e-16

From the above results, we can interpret that the intercept is 1.222e+01 which is the partial coefficients of sqft_living to intercept to 0 with the practical significance of less than 2e-16. Multiple R-Squared value is 0.4835 which means that the response variable price which is explained by the predictable vairable sqft_living. Adjusted R-squared tells us the 48% of the fit of the model which is not a good sign and the F-statistic: 2.023e+04 was to determine the practical significance. Less p-value indicates the rejection of the null hypothesis and determines that the intercept is not equal to 0.

1.ii)

Below code is to fit the model of price which is a response variable to the explanatory variable sqft_living.

fittedlm <- lm(house_data$price ~ house_data$sqft_living, data = house_data)
summary(fittedlm)
## 
## Call:
## lm(formula = house_data$price ~ house_data$sqft_living, data = house_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1476062  -147486   -24043   106182  4362067 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -43580.743   4402.690  -9.899   <2e-16 ***
## house_data$sqft_living    280.624      1.936 144.920   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 261500 on 21611 degrees of freedom
## Multiple R-squared:  0.4929, Adjusted R-squared:  0.4928 
## F-statistic: 2.1e+04 on 1 and 21611 DF,  p-value: < 2.2e-16

From the above results, we can interpret that the intercept is 1.131e+03 which is the partial coefficients of sqft_living to intercept to 0 with the practical significance of less than 2e-16. Multiple R-Squared value is 0.4929 which means that the response variable price which is explained by the predictable variable sqft_living. Adjusted R-squared tells us the 49% of the fit of the model which is not a good sign and the F-statistic: 2.1e+04 was to determine the practical significance. Less p-value indicates the rejection of the null hypothesis and determines that the intercept is not equal to 0.

1.e.

First is to fit the model and then from the fitted model extracted the residuals upon which we plotted to view the residual on the 0 line for the variance and the normality assumption to be satisfied.

Following code is to plot the normal q-q plots to check the residual data for normality and the studentized_residuals plot for the vairance assumptions to verify graphically.

library(e1071)
res <- residuals(fittedlm)

car::qqPlot(res, main = NA, pch = 19, col = 2, cex = 0.7)

## [1] 7253 3915
skew <- skewness(res)
cat("skewness of the residuals -> ", skew)
## skewness of the residuals ->  2.823452
boxplot(res, main = "Box Plot of Data", ylab = "Variable Name")

#we need to calcualte the studentised residuals to get the intervals 
sigma_residuals <- sigma(fittedlm)
studentized_residuals <- res / sigma_residuals


plot(fitted(fittedlm), studentized_residuals, xlab = "Fitted Values", ylab = "Studentized Residuals", 
     main = "Studentized Residuals vs. Fitted Values")
abline(h = 0, col = "red")

We cannot perform the shapiro wilk test on the data as the sample size is greater than 5000 which is the limit to perform the test. So for the larger data sets it is advisable to go with the pictorial representation and interpret the data. From the above normal q-q plot we can see that the data is deviating from the standard line which indicates that the data is not normalized. From the studentized residual vs fitted values plot we see that most of the data points lie in the range 3, -3 which infers the fitted model as a good it however outliers are exceeding the range. Also, from the below plotted normal q-q plot for the studentized_residuals shows the deviation from the line which does not satisfies the assumption of the normality and the variance.

car::qqPlot(studentized_residuals, main = NA, pch = 19, col = 2, cex = 0.7)

## [1] 7253 3915

Below analysis is to perform on the log transformed data and extract residuals from it.

#log trasnformed data
library(e1071)
res <- residuals(fittedlm_log)

car::qqPlot(res, main = NA, pch = 19, col = 2, cex = 0.7)

## [1] 12778  4025
skew <- skewness(res)
cat("skewness of the residuals -> ", skew)
## skewness of the residuals ->  0.02661918
#we need to calcualte the studentised residuals to get the intervals 
sigma_residuals <- sigma(fittedlm_log)
studentized_residuals <- res / sigma_residuals

plot(fitted(fittedlm_log), studentized_residuals, xlab = "Fitted Values", ylab = "Studentized Residuals", 
     main = "Studentized Residuals vs. Fitted Values")
abline(h = 0, col = "red")

From the above normal q-q plot we can see that the data is deviating from the standard line which indicates that the data is not normalized. From the studentized residual vs fitted values plot we see that most of the data points lie in the range 2, -2 which infers the fitted model as a good it however outliers are exceeding the range. Also, from the below plotted normal q-q plot for the studentized_residuals shows the deviation from the line which does not satisfies the assumption of the normality and the variance.

car::qqPlot(studentized_residuals, main = NA, pch = 19, col = 2, cex = 0.7)

## [1] 12778  4025

1.f.

Following chart is to fit the model of log transformed to figure out the outliers of the model

plot(log(house_data$price), predict(fittedlm_log,newdata = house_data), 
     col=4, cex=0.3, xlab="Actual", ylab="Predicted", axes=FALSE)

extpts <- which(abs(residuals(fittedlm_log)) > 3*sd(residuals(fittedlm_log)))
text(house_data$price[extpts], 
     predict(fittedlm_log, newdata = house_data)[extpts],
     rownames(house_data$price)[extpts], cex=0.5, col=2)
axis(1); axis(2); grid(); abline(0,1, col=4, lwd=3)

Above code is to extract the outliers from the data whose residuals are greater than the three times of the standard deviation of the model and eliminating from the data frame and fitting a model on top the new data set.

house_data.2 <- house_data[-extpts,]
mod.2 <- lm(log(price) ~ sqft_living, data = house_data.2)
summary(mod.2)
## 
## Call:
## lm(formula = log(price) ~ sqft_living, data = house_data.2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.12266 -0.28462  0.01431  0.26038  1.13015 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.221e+01  6.351e-03  1923.0   <2e-16 ***
## sqft_living 4.010e-04  2.800e-06   143.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3739 on 21559 degrees of freedom
## Multiple R-squared:  0.4875, Adjusted R-squared:  0.4875 
## F-statistic: 2.051e+04 on 1 and 21559 DF,  p-value: < 2.2e-16

From the above results, we can interpret that the intercept is 1.222e+01 which is the partial coefficients of sqft_living to intercept to 0 with the practical significance of less than 2e-16. Multiple R-Squared value is 0.4875 which means that the response variable price which is explained by the predictable vairable sqft_living. Adjusted R-squared tells us the 48.75% of the fit of the model which is not a good sign and the F-statistic: 2.051e+04 was to determine the practical significance. Less p-value indicates the rejection of the null hypothesis and determines that the intercept is not equal to 0.

There is a 0.0040 increase in the multiple R-squared value and there is no much difference in the model.

par(mfrow=c(2,2))
plot(mod.2)

plot(log(house_data.2$price), predict(mod.2,newdata = house_data.2), 
     col=4, cex=0.3, xlab="Actual", ylab="Predicted", axes=FALSE)
axis(1); axis(2); grid(); abline(0,1, col=4, lwd=3)

From the above chart we can view there are no outliers and are eliminated.

1.g.

fittedlm_log <- lm(log(house_data$price) ~ house_data$sqft_living, data = house_data)

summary(fittedlm_log)
## 
## Call:
## lm(formula = log(house_data$price) ~ house_data$sqft_living, 
##     data = house_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.97781 -0.28543  0.01472  0.26070  1.27628 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            1.222e+01  6.374e-03  1916.9   <2e-16 ***
## house_data$sqft_living 3.987e-04  2.803e-06   142.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3785 on 21611 degrees of freedom
## Multiple R-squared:  0.4835, Adjusted R-squared:  0.4835 
## F-statistic: 2.023e+04 on 1 and 21611 DF,  p-value: < 2.2e-16

From the above summary, the null hypothesis is less than 0 for the intercept and so we reject it as the p-value is less than 0.05 whereas it is the same case with the alternate hypothesis. We are using the t-test to evaluate the estimated co-efficient to 0. Following are the confidence intervals of the

coef_names <- names(coef(fittedlm_log))
confint(fittedlm_log)["(Intercept)", ]
##    2.5 %   97.5 % 
## 12.20597 12.23096
confint(fittedlm_log)["house_data$sqft_living", ]
##        2.5 %       97.5 % 
## 0.0003932515 0.0004042416

1.h.

new_data <- data.frame(sqft_living = c(1500))  

# Predict the response variable based on the new data
predicted_price <- exp(predict(mod.2, newdata = new_data,  interval = "confidence", level = 0.95))
predicted_price
##      fit      lwr      upr
## 1 367676 365508.4 369856.5

2.

Duncan is the dataset which posses the information about the various professions like, bc, wc and prof with their respective values like, income, educatoin and prestige. From the structure of the data Duncan, it is clear that the type variable is factor with three levels and the rest are the integers.

library("car")
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
data("Duncan")
str(Duncan)
## 'data.frame':    45 obs. of  4 variables:
##  $ type     : Factor w/ 3 levels "bc","prof","wc": 2 2 2 2 2 2 2 2 3 2 ...
##  $ income   : int  62 72 75 55 64 21 64 80 67 72 ...
##  $ education: int  86 76 92 90 86 84 93 100 87 86 ...
##  $ prestige : int  82 83 90 76 90 87 93 90 52 88 ...

2.a.

education_color <- "mediumpurple4"
income_color <- "darkgreen"

par(bg = "black")
ggplot(data = Duncan, aes(x = education, y = prestige, color = "Education")) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = education_color) +
  labs(title = "Education vs. Prestige", x = "Education", y = "Prestige") +
  scale_color_manual(values = c("Education" = education_color)) 
## `geom_smooth()` using formula = 'y ~ x'

# Create a regression plot for income vs. prestige with custom color
ggplot(data = Duncan, aes(x = income, y = prestige, color = "Income")) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = income_color) +
  labs(title = "Income vs. Prestige", x = "Income", y = "Prestige") +
  scale_color_manual(values = c("Income" = income_color))
## `geom_smooth()` using formula = 'y ~ x'

I have added two plots where one is between the income and the prestige and the other is between the education and prestige. For the plot between education and the prestige we can observe a liner relationship whereas the data point plotted are around the straight line even though certain data points are little far away from the stringht line assuming the data to be normal.

For the plot between Income and the Prestige we can also a linear relation ship between them and also a outlier where the prestige is more even though the income is less. We can test the outliers and for the leverage points we can eliminate them if they are outliers as they can mislead the data where as not the leverage points cannot be removed as they show strong effect on the data.

2.b.

mod.type <- lm(Duncan$prestige ~ Duncan$type, data = Duncan)
summary(mod.type)
## 
## Call:
## lm(formula = Duncan$prestige ~ Duncan$type, data = Duncan)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.444 -11.762   1.238   9.556  44.238 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       22.762      3.466   6.567 6.08e-08 ***
## Duncan$typeprof   57.683      5.102  11.305 2.54e-14 ***
## Duncan$typewc     13.905      7.353   1.891   0.0655 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.88 on 42 degrees of freedom
## Multiple R-squared:  0.7574, Adjusted R-squared:  0.7459 
## F-statistic: 65.57 on 2 and 42 DF,  p-value: 1.207e-13

From the above summary we can observer only for the results of bc and prof it is due to the less data of the wc level which is imbalanced in the data level. bc profession was negatively co-related with the prestige which means that the increasing in prestige level decreases the bc level and coming to prof type it is positively co-related which indicates that one unit increase in the prof 43.778 units increase in the prestige. From the above, we can conclude that the prof profession is enjoying the more prestige.

mod.type <- lm(Duncan$income ~ Duncan$type, data = Duncan)
summary(mod.type)
## 
## Call:
## lm(formula = Duncan$income ~ Duncan$type, data = Duncan)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -39.056 -12.056  -2.762  12.238  57.238 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       23.762      3.834   6.197 2.07e-07 ***
## Duncan$typeprof   36.294      5.644   6.430 9.55e-08 ***
## Duncan$typewc     26.905      8.134   3.308  0.00194 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.57 on 42 degrees of freedom
## Multiple R-squared:  0.5064, Adjusted R-squared:  0.4829 
## F-statistic: 21.54 on 2 and 42 DF,  p-value: 3.642e-07

From the above summary we can observer only for the results of bc and prof it is due to the less data of the wc level which is imbalanced in the data level. bc profession was negatively co-related with the income which means that the increasing by a unit of income level decreases the bc level by 26.905 and coming to prof type it is positively co-related which indicates that one unit increase in the prof, 9.389 units increase in the income From the above, we can conclude that the prof profession is enjoying the more income

2.c.

library("MASS")
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
hist(Duncan$prestige, breaks=30, main="", xlab="prestige")

skewness(Duncan$prestige)
## [1] 0.1368965
mod.1 <- lm(Duncan$prestige ~., data = Duncan)
summary(mod.1)
## 
## Call:
## lm(formula = Duncan$prestige ~ ., data = Duncan)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.890  -5.740  -1.754   5.442  28.972 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.18503    3.71377  -0.050  0.96051    
## typeprof     16.65751    6.99301   2.382  0.02206 *  
## typewc      -14.66113    6.10877  -2.400  0.02114 *  
## income        0.59755    0.08936   6.687 5.12e-08 ***
## education     0.34532    0.11361   3.040  0.00416 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.744 on 40 degrees of freedom
## Multiple R-squared:  0.9131, Adjusted R-squared:  0.9044 
## F-statistic:   105 on 4 and 40 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(mod.1)

From the above, summary of the model we can interpret that the intercept value where all the partial co-coefficients are 0 of the variables with corresponds to the response variable. Negative value of -0.18503 indicates negatively co-related and also wc factor value of type var also negatively co-related with value of -14.66113 where as remaining predictor variables such as income anf education were positively co-related with 0.59 and 0.34 values respectively. With the less p-value of 2.2e-16 which is less than significance level of 0.05 reject the null hypothesis and conclude that the predictor values are interact with the response variable. The high p-value associated with the intercept indicates that there is no strong evidence to show that the expected prestige at this baseline point is significantly different from zero.

Multiple R-squared which is an important parameter to take into consideration where it indicates the variability, and as the value is 0.9132 we can take as 91.31% which tell us the percent of variables contributing to the model in other words 91.31% of the variance in the response variable (prestige) is explained by the predictor variables that we had fit and the rest has no variability or unexplanined. Adjusted R-squared which build on the R-squared values indicates the good fit of the model.

Residual standard error is 9.744 which means that the actual values of the response variable differ from the predicted values produced by the regression model on the 40 degrees of freedom. The F-statistic tests the overall significance of the model. In this case, the F-statistic has a very low p-value (< 2.2e-16), indicating that the model as a whole is statistically significant.

2.d.

mod.1 <- lm(Duncan$prestige ~ education * income * type , data = Duncan)
summary(mod.1)
## 
## Call:
## lm(formula = Duncan$prestige ~ education * income * type, data = Duncan)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.6819  -4.6186  -0.4928   3.1740  22.3316 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)  
## (Intercept)                17.89437   12.36638   1.447   0.1573  
## education                  -0.55965    0.50082  -1.117   0.2719  
## income                     -0.29036    0.53576  -0.542   0.5915  
## typeprof                  -85.48040   58.97604  -1.449   0.1567  
## typewc                    -86.67731   58.71481  -1.476   0.1494  
## education:income            0.03918    0.01903   2.058   0.0475 *
## education:typeprof          2.03495    0.84976   2.395   0.0225 *
## education:typewc            1.97504    1.07925   1.830   0.0763 .
## income:typeprof             2.49297    1.19344   2.089   0.0445 *
## income:typewc               1.59260    1.00829   1.580   0.1238  
## education:income:typeprof  -0.06008    0.02269  -2.648   0.0123 *
## education:income:typewc    -0.05461    0.02395  -2.280   0.0292 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.015 on 33 degrees of freedom
## Multiple R-squared:  0.9386, Adjusted R-squared:  0.9182 
## F-statistic: 45.87 on 11 and 33 DF,  p-value: < 2.2e-16

From the above summary, we can observe that the education, income, typebc and typeprof have positive co-relation on the response variable where as interaction of the education and income is negatively co-related with the presitge variable which means that the every unit increase of value of education and income there is a decrease of -0.015434. Interaction of the education and typebc is negatively co-related with the presitge variable which means that the every unit increase of value of education and income there is a decrease of -1.975040. In the same way for the income:typebc negatively co-related.

education, income, typebc is positively co-related with the prestige variable where one unit increase in those three there is a 0.054614 units increase in the prestige variable where as education:income:typeprof is negativel co-related by -0.005462.

2.e.

library(e1071)
res <- residuals(mod.1)

car::qqPlot(res, main = NA, pch = 19, col = 2, cex = 0.7)

## machinist  minister 
##        28         6
skew <- skewness(res)
cat("skewness of the residuals -> ", skew)
## skewness of the residuals ->  0.7121299
boxplot(res, main = "Box Plot of Data", ylab = "Variable Name")

#we need to calcualte the studentised residuals to get the intervals 
sigma_residuals <- sigma(mod.1)
studentized_residuals <- res / sigma_residuals


plot(fitted(mod.1), studentized_residuals, xlab = "Fitted Values", ylab = "Studentized Residuals", 
     main = "Studentized Residuals vs. Fitted Values")
abline(h = 0, col = "red")

From the output we can the residuals follows normality except two data points minister machinist are outliers. As there are outliers in the data and we fitted the model on top of it the model could have been mislead due to the outliers. In this case, we need to decide wether to remove the outliers or not by reaching out to the domain teams. We also need to fit the model by removing the outliers and should interpret the result.s

2.f.

mod.edu_indcom <- lm(Duncan$prestige ~ Duncan$education + Duncan$income , data = Duncan)
summary(mod.edu_indcom)
## 
## Call:
## lm(formula = Duncan$prestige ~ Duncan$education + Duncan$income, 
##     data = Duncan)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.538  -6.417   0.655   6.605  34.641 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -6.06466    4.27194  -1.420    0.163    
## Duncan$education  0.54583    0.09825   5.555 1.73e-06 ***
## Duncan$income     0.59873    0.11967   5.003 1.05e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.37 on 42 degrees of freedom
## Multiple R-squared:  0.8282, Adjusted R-squared:   0.82 
## F-statistic: 101.2 on 2 and 42 DF,  p-value: < 2.2e-16

From the above result, p-value is less than 2.2e-16 which indicates the rejection of null hypothesis which implies that the coefficient for the predictor is not equal to zero and the variability of the model is 82% which is a good fit and the education as well as income is positively corelated and has practical significant.

res <- residuals(mod.edu_indcom)

car::qqPlot(res, main = NA, pch = 19, col = 2, cex = 0.7)

## minister reporter 
##        6        9
skew <- skewness(res)
cat("skewness of the residuals -> ", skew)
## skewness of the residuals ->  0.1497963
shapiro.test(res)
## 
##  Shapiro-Wilk normality test
## 
## data:  res
## W = 0.98254, p-value = 0.7234
boxplot(res, main = "Box Plot of Data", ylab = "Variable Name")

#we need to calcualte the studentised residuals to get the intervals 
sigma_residuals <- sigma(mod.1)
studentized_residuals <- res / sigma_residuals


plot(fitted(mod.1), studentized_residuals, xlab = "Fitted Values", ylab = "Studentized Residuals", 
     main = "Studentized Residuals vs. Fitted Values")
abline(h = 0, col = "red")

From the above plots, normality of the residuals are achieved we can say the model is a good fit however, there is an outlier which could mislead the model.

plot(jitter(mod.edu_indcom$fitted.values), rstudent(mod.edu_indcom), pch=19,
     col=ifelse(Duncan$prestige == 0, 3, 4),cex=0.3, xlab="",
     ylab="Studentized residuals", axes=F)
abline(h=0,col="grey", lwd=2)
axis(1);axis(2)

above chart illustrates model appears to be reasonable. However, the magnitude of the residuals may be higher than we would like - we expect of the Studentized residuals to be between 3 and -3, but we see that some residuals are outside this range.

2.g.

Following is the partial regression plot where it plots the values of predictors to the response whereas each predictor will be maped to the response variable by keeping the other predictor vars constant.

avPlots(mod.edu_indcom)

Follwoing are the partial residual plots for education and income show the relationship between each of these variables and the residuals from the full linear regression model that includes both education and income as predictors.

crPlots(mod.edu_indcom)

### 2.h.

Basically, in the linear regression model effect size can be concluded as R-squared values which tells us the best fit of the model where higher the value the best fit the model is as it explains the variability which means that quantifies the proportion of variability in the response variable explained by the predictors.. It’s value is in the range from 0 to 1. In the model of f, it’s value is 0.8282 which means the 82%. However, R-squared values does not alone contributes to the effect size.